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Abstract 

Background: Due to the difficulty in separating two (paternal and maternal) copies of a chromosome, most published 
human genome sequences only provide genotype information, i.e., the mixed information of the underlying two 
haplotypes. However, phased haplotype information is needed to completely understand complex genetic 
polymorphisms and to increase the power of genome-wide association studies for complex diseases. With the rapid 
development of DNA sequencing technologies, reconstructing a pair of haplotypes from an individual's aligned DNA 
fragments by computer algorithms (i.e., Single Individual Haplotyping) has become a practical haplotyping approach. 

Results: In the paper, we combine two measures "errors corrected" and "fragments cut" and propose a new 
optimization model, called Balanced Optimal Partition (BOP), for single individual haplotyping. The model 
generalizes two existing models, Minimum Error Correction (MEC) and Maximum Fragments Cut (MFC), and could 
be made either model by using some extreme parameter values. To solve the model, we design a heuristic 
dynamic programming algorithm H-BOP. By limiting the number of intermediate solutions at each iteration to an 
appropriately chosen small integer k, H-BOP is able to solve the model efficiently. 

Conclusions: Extensive experimental results on simulated and real data show that when k = 8, H-BOP is generally 
faster and more accurate than a recent state-of-art algorithm ReFHap in haplotype reconstruction. The running time 
of H-BOP is linearly dependent on some of the key parameters controlling the input size and H-BOP scales well to 
large input data. The code of H-BOP is available to the public for free upon request to the corresponding author. 



Background 

Each human somatic cell contains 23 pairs of chromo- 
somes, and there are about 0.5% differences between the 
DNA sequences of two copies of each chromosome [1]. 
The dominant DNA differences are single nucleotide poly- 
morphisms (SNPs). Identification of the combination of 
alleles at the SNP loci on the same chromosome copy, i.e., 
haplotyping, is needed to fully understand the human 
genetic variation patterns and enhance the power of gen- 
ome-wide association studies for complex diseases [2,3]. 
Currently, it is expensive and labor-intensive to separate 
two copies of chromosomes by biological techniques [4], 
and most published human individuals' genomes contain 
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only the mixed information, i.e., genotype information, of 
the underlying two copies of chromosomes [5]. Therefore, 
to reduce the cost, accurate and fast computational haplo- 
typing methods are of urgent importance. 

There have been many computational haplotyping mod- 
els [6-8] and they can be grouped into two main classes: 
haplotype inference and haplotype assembly [6] . Haplotype 
inference is to phase the haplotypes of individuals in a ped- 
igree or a population from their genotypes. Computer algo- 
rithms of haplotype inference have been used in the 
International HapMap Project [9] and the 1000 Genomes 
Project [5] to identify haplotypes. Haplotype assembly is 
also called Single Individual Haplotyping (SIH). SIH assem- 
bles a pair of haplotypes from an individual's aligned DNA 
sequence fragments. With the dramatically dropped cost of 
human whole genome sequencing, more and more human 
individual's DNAs have been sequenced. With mate-pairs 
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sequencing and read length improvements of the next- 
generation sequencing technologies, and with the develop- 
ment of new sequencing technologies, SIH methods have 
been used to build haplotype-resolved genome of human 
beings [10,11]. When there are enough DNA sequence 
fragments that cover two or more consecutive variant loci, 
SIH builds longer and more accurate haplotype blocks 
than haplotype inference does [12]. 

Since Lancia et al., first formalized the SIH problem 
[13], many optimization models and algorithms have 
been introduced to solve the problem [7,14-25]. The 
main models are MEC (minimum error correction) [24], 
MFR (minimum fragment removal), and MSR (minimum 
SNP removal) [13]. Recently, Duitama et al., proposed a 
new model MFC (maximum fragments cut) [16]. Most of 
the models are NP-hard and APX-hard [16,21,26], and 
their exact algorithms run in time exponentially depen- 
dend on at least one input parameter [15,17,19-21]. 
Therefore, a large number of heuristic algorithms have 
been designed to deal with the problem [16,22,23,25]. 
According to [16], one of the most accurate heuristic 
algorithms is HapCUT [25], while ReFHap [16] runs 
much faster than HapCUT without loss of accuracy. In 
this paper, we consider both quality measures "errors 
corrected" and "fragments cut", and propose a new opti- 
mization model, called Balanced Optimal Partition 
(BOP), for the SIH problem. The model generalizes the 
most popular model MEC and the recent model MFC. In 
fact, it could be made either model by setting some para- 
meters to extreme values. To solve the model, we pro- 
pose a dynamic programming algorithm H-BOP. By 
limiting the number of intermediate solutions at each 
iteration to an appropriately chosen small integer k, 
H-BOP is able to solve the model efficiently. The time 
complexity of H-BOP linearly depends on some of the 
key parameters controlling the input size and the algo- 
rithm scales well to large input data. 

Results and discussion 

We use a public available Java package SIH [27] to test the 
performance of H-BOP. The package contains a simulated 
data generator and implements algorithm ReFHap [12,16]. 
The simulated data are generated according to five para- 
meters: number of SNPs (haplotype length) n, number of 
fragments m, average fragment length /, sequencing error 
rate e, and gap rate g. In our experiments, since we only 
consider heterozygous SNPs, for each data set, a haplotype 
hi containing n SNPs is generated randomly first and then 
the other haplotype /z 2 is obtained by flipping each allele 
of h\. For each haplotype, m/2 fragments are randomly 
sampled from the haplotype and their lengths follow a 
normal distribution with mean / and variance 1. Finally for 
each fragment, every allele is flipped with probability e to 
introduce sequencing errors and, except at the first and 



last positions, every allele is deleted with probability g to 
introduce gaps. Given fragments generated as above, the 
average call coverage c is calculated by dividing the total 
number of alleles of the fragments by the haplotype length 
n. Please see [16] for more details. 

Among many algorithms for the SIH problem, Hap- 
CUT and ReFHap are two of the most accurate heuristic 
algorithms [12,16]. Since ReFHap is much faster than 
HapCUT, we only compare our algorithm with ReFHap. 
We implemented our algorithm H-BOP in Java and 
embedded it in the java package SIH and tested the 
accuracies, phased haplotype lengths and running time of 
H-BOP and ReFHap on simulated data and a real data 
set provided by [12]. All tests are carried out on a Win- 
dows 7 64 bit PC (3GHz CPU, 4GB RAM). To measure 
the haplotype reconstruction accuracy of an SIH algo- 
rithm, the hamming distance between the reconstructed 
haplotype pair and the real haplotype pair was previously 
used widely in the literature [14,23]. However, a recent 
study [28] showed it over-penalizes simple switch errors. 
Therefore, as in [12,16], we use switch errors to measure 
the accuracy of an algorithm. A switch error is an incon- 
sistency between the reconstructed haplotype pair and 
the real haplotype pair over two contiguous SNPs. There 
may be some SNP sites where an algorithm is unable to 
determine the alleles of a haplotype. The phased haplo- 
type length is defined as the number of SNP sites where 
the alleles of the reconstructed haplotype pair are deter- 
mined. And the number of switch errors divided by the 
phased haplotype length is called switch error rate. 

If the allele of a fragment /at a SNP site s is known, we 
say/ covers s. When there are no fragments covering two 
consecutive loci, it is not possible to determine the haplo- 
type containing these consecutive loci for all SIH models. 
Therefore, for each test we divide a haplotype into blocks 
according to the input fragments as in [16]. A block corre- 
sponds to a connected component of a graph G = (V, E) 
where V is the set of the SNP sites and there is a edge 
between two SNP sites s 1 and s 2 if and only if there is a 
fragment covers both Sj and s 2 . The switch errors of an 
algorithm are the sum of the switch errors in all blocks. In 
the following simulation tests, the haplotype length n = 
100, the gap rate g = 0.1 and each result is the average 
over 100 repeated experiments if there is no explicit 
specification. 

Parameters of the algorithm H-BOP 

There are two parameters w and k in H-BOP. The para- 
meter w is a weighting factor. H-BOP tries to seek a solu- 
tion with the minimum number of errors corrected when 
w = 0, and a solution with a maximum cut of the 
weighted conflict graph corresponding to the input frag- 
ments [16] when w is set sufficiently large, k is the maxi- 
mum number of intermediate solutions that we will keep 
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at each iteration of H-BOP. When k is large enough, H- 
BOP in fact becomes an exact algorithm. To choose 
appropriate values for w and k, we test H-BOP on differ- 
ent combinations of w and k. 

In Figure 1, the haplotype length n = 100 and the aver- 
age fragment length / = 3. In Figure 1(a), the number of 
fragments m = 140 (c = 4.25) and k = 16. In Figures 1(b) 
and 1(c), m = 210 (c = 6.42) and w = 0.1. Figure 1(a) 
shows that when the sequencing error rate e increases 
from 0.01 to 0.05, the switch errors of H-BOP increases 
accordingly. The switch errors of H-BOP is larger when 
w = 0 than those when w >0 with e unchanged, which is 
obvious when e = 0.05. It indicates that the optimal objec- 
tive minimum errors corrected leads large switch errors 
when sequencing error rate is high. When w = 0.01 and 
0.1, the switch errors of H-BOP are smallest. Figure 1(b) 
shows that when k increases from 1 to 8, the switch errors 
decrease accordingly. When k increases from 8 to 64, 
there are no significant improvements in switch errors of 
H-BOP. When sequencing error rates are high, exact opti- 
mal solutions may incur large switch errors and Figure 1 
(b) indicates that when k increases from 8 to 64, switch 
errors of H-BOP increases accordingly. Figure 1(c) shows 
that the running time of H-BOP increases linearly with k 
when the haplotype length n, number of fragments m and 
average fragment length / remain fixed. In the following 
tests, we set w - 0.1 and k = 8 for H-BOP. 

Simulation results 

We changed the sequencing error rate e, the average frag- 
ment length I and the number of fragments m to generate 
different fragment data sets, and compared the perfor- 
mance of H-BOP and ReFHap. Figure 2 shows that 
H-BOP and ReFHap are both accurate and there are only 
several switch errors in reconstructing haplotypes of 100 
SNPs. The accuracies of both algorithms decrease with the 
increase of e and improves with the increase of m. These 
results are consistent with those in [16], which claim that 



the accuracy of an SIH algorithm increases with decreas- 
ing sequencing error rate and increasing call coverage. In 
a half of the total 48 cases shown in Figure 2, H-BOP pro- 
duces fewer switch errors than ReFHap, especially when 
I = 3 and e > 0.02. In the other half, H-BOP presents a few 
more switch errors than ReFHap only in two cases {i.e., 
Figure 2(a), m = 140, e = 0.005 and Figure 2(b), m = 111, 
e = 0.02). In the remaining 22 cases, H-BOP has the same 
switch errors as ReFHap. 

The phased haplotype lengths of both algorithms gener- 
ally increase with decreasing e and increasing m. Table 1 
shows that when / = 3 and m <351, or when e = 0.05 
(except when / = 10 and m >44), H-BOP is able to phase 
more SNPs than ReFHap. In other cases the phased haplo- 
type lengths of H-BOP and ReFHap are equal. 

We set e = 0.01 and varied the haplotype length n, the 
number of fragments m and the average fragment length 
/ to compare the running time of H-BOP and ReFHap 
(Figure 3). When n = 100, Z = 3 and m increases from 
100 to 400, the running time of H-BOP increases line- 
arly, but the running time of ReFHap increases sharply 
(Figure 3(a)). When m reaches 400, the running time of 
H-BOP is only about 4 seconds, while that of ReFHap is 
about 468 seconds. When n increases from 50 to 200 while 
m = 200 and / = 3, the average call coverage decreases and 
the running time of both algorithms decreases accordingly 
(Figure 3(b)). When m = 200, n = 100 and / increases from 
3 to 9, the running time of ReFHap increases significantly 
while that of H-BOP increases slowly and remains less 
than 5 seconds (Figure 3(c)). When the number of frag- 
ments and the average fragment length increase, the aver- 
age call coverage c increases accordingly. When c is large, 
H-BOP runs much faster than ReFHap. 

Results on real data 

We downloaded a real data set from the SIH website 
[27], which contains the aligned sorted fosmid-based 
NGS DNA sequence fragments and gold-standard 
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Figure 1 Performances of H-BOP with different parameter values. Here, the haplotype length n = 100, the gap rate g = 0.1 and the average 
fragment length / = 3. (a) The number of fragments m = 140 and ft is 16. (b) w = 0.1 and m = 211, (c) w = 0.1, m = 21 1 and e = 0.01. 
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Figure 2 Switch errors of ReFHap and H-BOP with varying e, m and /. The parameters k and w of H-BOP are set as 8 and 0.1, respectively, 
and the haplotype length n is again set as 100. 
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haplotypes of a HapMap trio child, NA12878 [12]. The Since the coverage is low, there are many consecutive 

total heterozygous SNP sites of the data are 1,704,166, heterozygous SNP site pairs not covered by any frag- 

the total fragments are 285,341, the average fragment ments, and hence the 23 pairs of chromosomes are 

length is 18.03, and the average call coverage is 3.02. divided into 17,839 haplotype blocks. Due to the low 
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Table 1 Average phased haplotype lengths of ReFHap and H-BOP 
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The numbers in the table are the average of 100 repeated tests. 



coverage, both H-BOP and ReFHap ran very fast and 
completed the reconstruction of haplotypes for all 23 
pairs of chromosomes in about half a minute. The total 
phased haplotype lengths of H-BOP and ReFHap are 
1,563,741 and 1,562,402 respectively. Compared with 
the gold-standard haplotypes, the total switch errors of 
the haplotypes built by H-BOP and ReFHap are 21,859 
and 21,835 respectively. Though the switch errors of H- 
BOP is larger than that of ReFHap, the switch error 
rates of H-BOP and ReFHap are both 0.014. 

To account for both completeness and quality, Duitama 
et al. [12] proposed an alternative measure QAN50 (qual- 
ity adjusted N50). QAN50 is calculated as follows [12]: 

(1) Break every haplotype block into the longest possi- 
ble segments containing no switch errors. 

(2) Calculate span (in reference base pairs) from the 
first phased SNP to the last phased SNP for every 
segment. 

(3) Adjust each span by multiplying the span with 
phased SNPs ratio (the number of phased SNPs divided 
by the number of total SNPs) inside the segment (to 
correct for un-phased SNPs). 

(4) Sort segments from the largest to the smallest 
adjusted span. 



(5) Traverse the list and count the number of phased 
SNPs. When the count is more than a half of the total 
number of SNPs, the adjust span of the current segment 
is QAN50. 

Clearly, algorithms with larger QAN50 values are 
more desirable. The QAN50 values of H-BOP and 
ReFHap on the above real data set are 114,261.09 and 
113,831.67, respectively. 

Conclusions 

Haplotyping is regarded as one of the hardest challenges 
in personal genome sequencing. Though some single 
molecule sequencing technologies have been developed, 
they are still too expensive and labor-consuming. Compu- 
ter algorithms are widely used to reconstruct haplotypes 
in personal genome sequencing. SIH uses computer algo- 
rithms to build a pair of haplotypes from an individual's 
aligned DNA sequence fragments. There are many differ- 
ent combinatorial optimization models for SIH, among 
which MEC is the most popular and MFC is the most 
recent introduction [16]. In this paper, we combine the 
two quality measures "errors corrected" and "fragments 
cut" used in MEC and MFC and introduce a new model 
BOP. We design a heuristic dynamic programming 
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Figure 3 Running time of ReFHap and H-BOP. (a) The number of fragments m varies with n = 100 and / = 3. (b) The haplotype length n 
varies with m = 200 and / = 3. (c) The average fragment length / varies with m = 200 and n = 100. 
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algorithm H-BOP to solve the model. By setting appropri- 
ate parameters of the algorithm, H-BOP is accurate and 
fast. Extensive simulation experiments show that H-BOP 
is generally faster and more accurate in assembling haplo- 
types than a recent state-of-art algorithm ReFHap. When 
the average fragment length is small and sequencing error 
rate is relatively high, H-BOP is significantly more accu- 
rate than ReFHap. The running time of H-BOP increases 
linearly as the average call coverage c increases, and 
H-BOP runs much faster than ReFHap when c is large. 
The test on a real data set also shows that H-BOP is 
superior to ReFHap considering both completeness and 
quality of the reconstructed haplotypes. 

Methods 

Formulation and problem 

With the input of aligned DNA sequence fragments 
derived from a pair of chromosomes, SIH tries to recon- 
struct a pair of haplotypes of their underlying chromo- 
somes. If there are no sequencing errors and for any two 
consecutive (but not necessarily adjacent) heterozygous 
SNP loci, there is at least one fragment covering both, we 
can easily determine the linkage relationship between two 
consecutive heterozygous SNPs and thus SIH is easy. 
However, sequencing errors are unavoidable which makes 
the problem complicated. In the following, we first intro- 
duce some notations and definitions similar to those in 
[15,16,20,23], and then propose a new optimization model. 

Since we only consider the alleles at SNP loci in the 
SIH problem, the input aligned fragments are encoded 
as an m x n SNP matrix M [15,16,25], where m is the 
number of fragments and n the number of SNP loci. An 
entry at the fth row and the /th column of M is denoted 
as M[i, /]. M[i, /] takes a value from {0, 1, -}, where '0' 
(or '1') encodes the major allele (or the minor allele, 
respectively) in the population and '-' represents an 
unknown allele. As in previous work [16,25], we assume 
that all SNP loci are heterozygous and every fragment 
covers at least two heterozygous SNP loci. If this is not 
the case, a simple preprocessing as in [22] and [29] can 
be used to remove homozygous loci and fragments cov- 
ering only one heterozygous SNP locus. In the remain- 
der of the section, the ith row of M is equivalent to the 
/th fragment and the /' column of M is equivalent to the 
/th SNP locus without explicit specification for briefness. 

Let a, b e {0, 1, -} and define 



c[a, b) = 



1, if a, b ^ —and a ^ b; 
0, otherwise. 



(1) 



Given an m x n SNP matrix M, let the underlying hap- 
lotype pair be U = (H!,H 2 ). The allele at the /th SNP 
locus of Hi (or H 2 ) is denoted as Hi\j] (or H 2 \j], respec- 
tively). Notice that since all SNP loci of interest are 



heterozygous, for any /' e {1, «}, Hi\j] * H 2 \j], i.e. when 
Hi\j\ = 0, H 2 \j] = 1 and when HAj] = I, H 2 [J] = 0. Let /, 
denote the ith fragment (i.e. the ith row of M). c(M[ii, /], 
M[i 2 , /]) = 1 means that and fi 2 conflict at SNP locus / 
{i.e. column /), and that if both fragments come from the 
same chromosome, either M[i lt /] or M[i 2 , /] is a sequen- 
cing error. Similarly, c(M[i, /], H p [j\) - 1 (p = 1 or 2) 
means that/ and H p conflict at SNP locus /, and that iifl 
comes from the chromosome with haplotype H p , M[i, /] 
must be a sequencing error. 

Let s c (H p , f) = n cMi, j], H p \j]) and 

s c {H, fi) = min(s c (H 1 , /;), s c (H 2 , f)) and define 



s c {H, M) = /'))■ 



t=l, ...,m 



(2) 



If the real haplotype pair is %, it is easy to verify that 
the number of sequencing errors in the input fragments 
is at least s c {H, M), which we call the errors corrected 
measure. Based on the principle of parsimony, it is a 
natural optimization objective that to minimize the 
number of errors and hence Minimum Error Correction 
[20,23,24] is the most popular model for SIH. 

Minimum Error Correction (MEC): Given an m x n 
SNP matrix M, find a haplotype pair % such that 
ScifhL, M) is minimized. 

Based on the underlying haplotype pair TL = (H 1; H 2 ), 
it is easy to partition the fragments into two groups Gi, 
G 2 according to the following rule: For each fragment^, 
if s c (H lt fy < s c (H 2 ,fd, add fl to G lt otherwise add f t to 
G 2 . Let V-u denote the partition (G 1( G 2 ) obtained by the 
above rule. A partition V = (Gi, G 2 ) of a fragment set R 
is encoded as a map such that V{f) = 0 (or 1) if the 
fragment f'mR belongs to Gi (or G 2 , respectively). 

Conversely, given a partition V = (Gi, G 2 ), a haplo- 
type pair H = [Hi,H 2 ) can be constructed as follows. 
Let N g>v [j] be the number of fragments of G g whose 
allele at the /th locus is v for g = I, 2 and v = 0, 1. For 
each SNP locus /, if N XA \j\ + N 2 , 0 \j] < N 1J0 \f\ + N 2il \j], 
Hx\j] = 0 and H 2 [j] = 1; otherwise, H^] = 1 and H 2 \j] = 
0. Let HLp denote the haplotype pair obtained by the 
above method. 

Theorem 1 Given an SNP matrix M and a haplotype 
pair H,s c {H, M) > s c (M Pn , M). 

There is another formulation of MEC equivalent to 
the above one: Given an SNP matrix M, find a partition 
T> of the rows in M such that s c (M-p,M) is minimized. 
In the following, s c {K-p,M) is called the errors corrected 
measure ofV- While MEC aims to find a partition such 
that the conflict between the fragments in the same 
group is minimized, Maximum Fragments Cut (MFC) 
[16] aims to find a partition such that the conflict 
between the fragments in G 1 and the fragments in G 2 is 
maximized. 
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Let a, b e {0, 1, -} and define 



d(a, b) 



— I, ii a, b ^ —and a = b; 
1, if a, b ^ —and a ^ b; 
0, otherwise. 



(3) 



Let and /; 2 be two rows of M and define 
dfo, 4) = E , (^[''i' Jl' M t i2 ' A « ^ [16], 

*-^j=l,...,n 

we convert an m x n SNP matrix M into a weighted 
complete graph Q = {V, E), where V is the set of rows in 
M and the weight of the edge between two rows fa and 
fa is d{fi lt fi 2 ). Therefore, a partition V of the rows in M 
corresponds a cut of Q. Given an SNP matrix M and a 
partition "P, the fragments cut measure is defined as: 



Sd{V,M)= J2 Wiuk)- 

V(f h )=l,V(f i2 )=2 



(4) 



Maximum Fragments Cut (MFC) [16]: Given an SNP 
matrix M, find a partition V of the rows of M such that 
SdCP,M) is maximized. 

To take into account both the conflict between the 
two groups and the conflict between the fragments 
within the same group, we introduce a new score com- 
bining the above errors corrected measure and the frag- 
ments cut measure. Given a partition V of the rows of 
M, define the partition score as 



s p {V, M) = s c {M v , M) - ws d {V, M), 



(5) 



where w is a weight factor that is used to adjust the 
weight of the fragments cut measure. In the following 
we propose a new optimization model for the SIH 
problem. 

Balanced Optimal Partition (BOP): Given an SNP 
matrix M, find a partition V of the rows in M such that 
Sp(P,M) is minimized. A solution to BOP of M is 
denoted by BOP(M), i.e. a partition with the minimum 
partition score. 

Note that when w = 0, BOP becomes MEC which has 
been proved NP-hard [24] and APX-hard [26]. There- 
fore, BOP is NP-hard and APX-hard. 

Algorithm 

Given an m x n SNP matrix M, there are 2 1 "' 1 different 
partitions of m rows in M. Therefore, when m is large, 
it is impractical to enumerate all possible partitions and 
choose one with the minimum partition score. To solve 
the BOP model of M efficiently, we propose a dynamic 
programming algorithm in the subsection. We first con- 
sider the first row of M, then the first two rows and so 
on until we have considered all rows of M. 

We need some definitions and notations. Let M [l..i, :] 
denote the SNP matrix consisting of only the first i rows 
of M. The first and last columns at which the z'th row of 



M takes non '-' values are denoted by l(i) and r(i), 
respectively. For a column /, if l{i) < j < r{i), row i spans 
column /. In the following, we assume that all the rows 
of M are sorted such that if i 1 < i 2 , l{i\) < l{i 2 ) or Kh) = 
l{i 2 ) A r(i{) < r(i 2 ). Let R(i) denote the row set contain- 
ing the rows in M[l..i, :] that span column l(i). 

Let V be a partition of a row set R and V a partition 
of a subset R' of R. If for every row i e R', V{i) = T"[i\ 
jy is called the projection of V on R' and T> is called an 
extension of p on 7?. Fix a row i and let V be a partition 
of R(i). J>' is an optimal extension of V, if the following 
conditions hold: (1) V' is an extension of V on the row 
set R = {1, i}; (2) for any possible extension V" of V 
onR, s p {V, M[l..i, :])<s p (V", M[l..i, :]). 

Given a partition V of R(i), let b^V) denote an opti- 
mal extension of V- We call s p {£i(T), M[\..i, :]) the par- 
tition score of V and denote it as Sp{T) for briefness. 

Theorem 2 For an m x n SNP matrix M, let The a 
partition of R(m). S m {V) is a solution to BOP of M if 
the following condition holds: for any possible partition 

vofRd), 5™(P) < s;{v\ 

Consider the submatrix containing only the first row 
of M. Since R(l) contains only one row, i.e. R(l) = {1}, 
there are only two possible partitions V\ and V 2 of R(l) 
(Vi and V2 are in fact equivalent): 'Pi(l) = 0 iff 
7^2(1) = 1- It is easy to verify that the following equal- 
ities hold for i = 1; 2. 



£i(V i )=V i ;s 1 JV i )=0. 



(6) 



After Si{V) and Sp{V) have been calculated for every 
possible partition j> of R(i), we consider the submatrix 
containing the first i + 1 rows of M. Let R c (i, i + 1) = R 
{i) n R(i + 1), and we calculate £ i+1 (V) and sj, +1 {V) for 
every possible partition p' of R(i + 1) according to the 
following method. Let q be the number of the rows in R 
(i) but not in R c (i, i + 1), i.e. q = \R(i)-R c (i, i + 1)|. For a 
partition J>" of R c {i, i + 1), there are 2 q distinct exten- 
sions of V" on R(i). Suppose V m is the one whose parti- 
tion score is the minimum among all 1 q extensions. 
Then £ t {V") and s p {V") can be computed with the fol- 
lowing equations: 



£ i {V")=£ i {V m yjAV") = sUV m ). 



(7) 



Since the rows in M are sorted, it is easy to verify that 
R(i + 1) = R c {i, i + 1) U {i + 1}, and that the number of 
all possible distinct partitions of R(i + 1) is two times 
that of R c (i, i + 1). For each partition J>" of R c (i, i + 1), 
there are two distinct corresponding partitions V\ 
and Vi of R(i + 1): for each I e R c (i, i + 1), 
7Y(0 = 7V(0 = V"{l);Vi'{i) = 0, but V 2 '(i) = 1. Opti- 
mal extensions of V\ and V 2 and their partition scores 
can be calculated with the following equations: 



Xie ef al. BMC Systems Biology 2012, 6(Suppl 2):S8 
http://www.biomedcentral.eom/1752-0509/6/S2/S8 



Page 8 of 10 



Algorithm H-BOP 

input: an m X n SNP matrix M, a weight factor w, an integer k. 

II all columns of M are heterozygous 

// all rows of M span at least two columns 

/ / and the rows of M are sorted 
output: a solution to BOP of M. 

1. initiation: 

1.1. i = LR(l) = {l},q = 1,2 = fc; 
if 2 1 ? < k then z = 2<? ; 
initiate a heap H of size z; 

//a partition is encoded by a binary number 

1.2. forP = 0 2«-l do // Eq.(6) 

£ 1 (P) = P,sj > (P) = 0; 
//.insert^, £i(P),«J(P)); 

2. computation of R c (i,i + 1): 

2.1. 7? c (i,z + 1) = 0,d = 0; 

2.2. for each row / e do 

if Z(i + 1) < r(/) then 

R c (i,i + 1) = fl c (i, i + 1) U {/}, d + +; 

3. projection on /? c (i, i + 1): 

3.1. initiate a heap H' of size 2; 

3.2. for each element L in H do //Eq.(7) 

3.2.1. L.P = the projection of L.P on R c {i,i + 1); 

3.2.2. if H' has a L' such that L'.P = L.P then 

{ if L'.s > L.s then replace L' with L; } 
else //'.insert(L); 

4. extension on R(i + 1): 

4.1. P(i + 1) = Rc(i,i + 1) U {i + 1}; 9 =| + 1) |; 2 = fc; 

4.2. if 2i < k then z = 2i; 

4.3. initiate a heap of size z; 

4.4. for each element L in H' do //Eq.(8) - Eq.(ll) 

P{ =L.P,£ i + 1 (P{) = £(L.P); 
sip +1 (P[) = s p (P) + DeltaEC(i, P[) 

-w x DeltaFC(i,P{); 
i?.insert(Pi , £ i+ x (P( ) , s l p + 1 (Pf )) ; 
P^ = P + 2 d ,^ +1 (P^)=5(P)+2'; 

4 +1 (P2) = s p( P ) + DeltaEC(i, P£) 

-w x DeltaFC(i, P£); 

H.msertiV^ , £ i+1 (P^ ) , 4+ 1 (PJ ) ) ; 

5. next iteration: 

if i < m — 1 then i + +, go to Step 2; 

6. finish: //Eq.(12) 

6.1. let L be the minimum element in H; 
6.3. BOP(M) = £ m (P); 

Figure 4 H-BOP Algorithm. 
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Em{Vi){J) -\v\[11 i = i (8) 

F i+ i{V 2 ){l)-\ v , 2(l)i l = i (9) 

4 +1 PY) = tyP") + W) - W); (10) 

s^pv) = 4cn + W) - ^(p 2 '). (id 

In Equation (10) (or 11), «5 C (W) (or S C (V 2 ')) is the dif- 
ference between the errors corrected measures of 
£m{Vi') (or S i+1 (V 2 ')) and £i(P"); and W) (or i a (7V)) is 
the difference between the fragments cut measures of 
(or£ i+ i(7V)). The values « e (P) and S d (V) are 
calculated by calling the following functions DeltaEC(i, V) 
and DeltaFC(i, V), respectively. 
DeltaEc\i, V) 
{S = 0; 
for /' = l(i + 1),..., r(i + 1) do 
{ if M[i + 1, /] = = '-' then continue; 
N h0 = N 2 , 0 = A/ u = A/2,1 = 0; 
for each row / e R c (i, i + 1) do 
{ if M[l, j] = = '-' then continue; 

v = M[l, j], g = V (l),N &v + +;} 
if N hl + A/2,0 ^ A/ lj0 + A/ 2j i then 
{ j ='<5 - (A/ 1;1 + A/2,0); 1 
else {8 = 6- (N lfl + N 2A) ; } 
v = M[i + 1, ;], g = V(i + 1), A/ sv + +; 
if N h i + A/2,0 ^ A/ 1>0 + A/ 2> i then 
{ S = 8 + (A/i,i + A/2,0); } ' 
else { 5 = <5 + (A/ 1>0 + A/2,!); } 

} 

return 8; } 

DeltaFC{i, V) 
{8 = 0,g 0 = V(i+ 1); 
for /' = l(i + 1), r(i + 1) do 
{ if M[i + 1, /] = = '-' then continue; 
for each row / e R c (i, i + 1) do 
{ if M[l, j] = = '-' then continue; 
v = M[l, j], g = V (/); 
if g == go then continue; 
if v == M[i + 1, j] then 8 - -; 
else 8 + +; } 

} 

return 8; } 

When £ m (V) and ^™(^ ? ) are known for every possible 
partition ~p of i?(w), a solution to BOP of M is easily 
obtained by using the following formula: 

BOP{M) = £ m {V)\s^{V) is minimum. (12) 



Based on the above equations, we can construct an 
exact dynamic programming algorithm for BOP. How- 
ever, the complexity of this exact algorithm increases 
exponentially with the number of rows in R(i), which 
implies that the algorithm is impractical when the call 
coverage is large. Let V* be the projection on R(i) of the 
global optimal partition of M. If the partition score of 
V* is among the k smallest ones of all possible partitions 
of R(i), we only need to compute k partitions of R(i) 
whose partition scores are the smallest in each iteration 
without losing the global optimal partition at the end. 
Based on the above idea, we propose a heuristic algo- 
rithm H-BOP whose pseudo-code is shown in Figure 4. 
In the algorithm, a partition V of a row set is encoded 
by a binary number P, and V(i) is represented by the 
ith bit of P. Therefore, the number set {0, 2 q - 1} 
encodes all possible partitions of a row set containing q 
rows (in fact, there are only 2 ?1 different partitions, and 
in our implement of H-BOP, we use a binary number of 
q - 1 bits to encode a partition to save time and space). 
In each iteration of H-BOP, for each row i we maintain 
a binary max heap H to store the candidate partitions of 
R{i) whose partition scores are among the k smallest. 
The heap H can store at most k elements, and each ele- 
ment L of H contains a partition p, £i(P) and s' p (P), 
which are denoted as L.P, L. £, and L.s respectively. The 
value Sp(P) of each element in the heap is larger than or 
equal to those of its two children. When the number of 
elements of H is smaller than its expected size {i.e. k), 
and a new element L arrives, the element is inserted 
into H and H is adjusted to maintain the max heap 
property. When the number of elements of H reaches k, 
L is compared with the root r of H. If L.s < r.s, the root 
is replaced by L and H is adjusted accordingly; other- 
wise, the new element is discarded. The above operation 
is denoted by H.insert(L). 

The time complexity of H-BOP is Oimkk^), and the 

space complexity is 0(mki + mk), where 
h = max (r(i) - + 1) anc [ k 2 = max [\R[i)\) 

i=l,...,m i=l,...,m 
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